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Abstract 

We address the problem of building an index for a set D of n strings, 
where each string location is a subset of some finite integer alphabet of 
size a, so that we can answer efficiently if a given simple query string 
(where each string location is a single symbol) p occurs in the set. That 
is, we need to efHciently find a string d £ D such that p[i] € d[i] for every 
i. We show how to build such index in 0(n'°^''/^'''^ log(n)) average time, 
where A is the average size of the subsets. Our methods have applications 
e.g. in computational biology (haplotype inference) and music information 
retrieval. 

Keywords: algorithms; approximate string matching; subset matching; finite- 
state automaton minimization 



1 Introduction 

Let E = {0, . . . , fT — 1} be an ordered integer alphabet. We are given a set 
D = {do, ■ • ■ , dn-i} of strings, called a dictionary. Each location j of the string 
di is a subset of E, i.e. di[i] C E for every Q < i <n — \ and < j < \di\ — 1. 
A string p is called simple if its each location is a single symbol from E, i.e. 
p[j] G E. The simple query string p matches the dictionary string G D iff 
p[i] G di[j] for Q < i < bl — 1 and \p\ = \di\. We consider the following two 
problems: 

Problem 1 Decide if p matches any string in D. 

Problem 2 Retrieve the set L = {ji, . . . , jV} such that p matches dj. for all 
ji e L. 

In particular, we set out to efficiently build a small index for D such that both 
problems can be solved in 0{\p\) time. 

Efficient solution of these problems have applications in computational bi- 
ology, in matching DNA (cr = 4) or protein {a = 20) strings, or in haplotype 



inference (cr = 2) [21 [JU]. Finally, note that if \di[j]\ is either 1 or cr for all 
then we have a special case called wild-card matching [3]. Another special 
case is S-matching (see e.g. [5]), where we have di[j] = {cij — 5, ... , aj + 6} 
where Cij S E, and 5 < a. These variants have applications in indexing natural 
language words and in music information retrieval. 

1.1 Related work 

Assume that the longest string in D has length m and that for every di £ D 
there are at most k locations where 1- The immediate trivial solution 

to our problem would then be as follows. First generate all the simple strings 
of length TO that match a string in D. Call the set of these strings D' . The 
size of D' is upper bounded by 0{na^). The problem is now transformed to 
exact matching, so we can insert all strings in D' to some data structure that 
can answer whether a given simple query string matches a string in the data 
structure exactly. One such data structure is a path compressed trie [7] (cf. 
Sec. [2]). This can be naively built in 0{m\D'\) = 0{mn(j^) time and space. The 
queries can be answered in 0(|p|) time. 

This is also the approach in [10] . They give two non-trivial algorithms to con- 
struct the (path compressed) trie faster, namely in 0(nm-f cr'^n log(min{n, m})) 
and 0{nm + a^n -f cr'^/^n log(min{n, m})) time, yielding query times of 0(|p|) 
and 0(|p| loglog(cr) -I- min{|p|, log(cr'^n)} log log(cr'^ri)) respectively (the latter 
method in fact uses two tries). 

The techniques in can be adapted [TU] to solve the problem with 
0{nm \og{nm)+n\o^{n/kV)) preprocessing time, and 0{m + \o^{n) log log(r7,)) 
query time. 

1.2 Our contributions 

Inspired by [TU], we also take the approach of computing the trie for D' as a 
starting point. However, instead of a trie, we directly build a pseudo-minimal 
(cf. Sec. 12. 2p deterministic finite-state automaton (DFA) corresponding to the 
set D'\ i.e. our method does not explicitly generate the set D' . The resulting 
automaton can be used to solve Problems 1 and 2 in 0(|p|) time. This au- 
tomaton can be easily and efficiently minimized (again, cf. Scc. l2.2p . so that the 
Problem 1 can still be solved in 0(|p|) time. We also propose a form of path 
compression that can further save space and speed up the construction. We 
show that our construction works in 0(71'°^"/^ log(ri)). average time, where 
A = avg \d,[j]\. 

As shown experimentally, our algorithm can be orders of magnitude faster in 
construction time than the related nai'vc approach of first building a trie for D' , 
and then converting it to the minimal DFA, or directly building the minimal 
DFA from D' . The pseudo- minimal automaton is more efficient to construct 
than the true minimal automaton, and is in practice only slightly larger. 
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2 The algorithm 



Let us define a DFA as M{Q,Y.,S,q,F), where Q is the set of states, q is the 
initial state, C Q is the set of accepting states and S € Q x T, ^ Q is the 
transition function. For convenience we also define S*{q, aw) = d*{d{q, a), w) for 
a string w G T,*. 

2.1 Prelude 

Traditionally a trie [7] is described as being a rooted tree storing a set of (simple) 
strings. Each node has at most a children, and the (directed) edges are labeled 
by the symbols in S. In path compressed trie the unary paths are compacted to 
single edges, labeled by strings consisting of the concatenation of the symbols in 
the original path. In both cases, a path from the root to any node u spells out 
a prefix of a subset of the strings stored in the trie, and that subset is stored in 
the subtree rooted at u. The trie can be seen as a DFA in an obvious way; the 
root node corresponding to the state q, and the labeled edges corresponding to 
S. 

We extend the DFA so that for the nodes u G F we attach a list L, storing 
the corresponding string identifiers. More formally, we define 

ji G L{u) <;=> u matches dj. G D, (1) 

where u denotes the string spelled by the path from q to u, i.e. u = 
{w I 6*{q, w) = u). Thus by generating all the strings D' that match a string in 
D and building a DFA for D' , Problems 1 and 2 can be solved in 0(|p|) time. 

One of the problems of this approach is that \D'\ can be large. A way to 
alleviate this is to minimize the DFA. There exists a large number of algorithms 
for this task [4] . Some of these can build the automaton incrementally, inserting 
one string at a time while maintaining the automaton in minimal state (e.g. [B]). 

This can still be unnecessarily slow. Moreover, the result does not allow 
proper mapping between the states and the lists L. E.g. if all the strings in D are 
of equal length, the resulting minimal DFA would have only one accepting state. 
However, this automaton can still be used to solve Problem 1. Another solution 
is to construct a pseudo-minimal DFA [TTJ [S] still allowing mapping states or 
transitions to strings. We take a similar approach, although our definition of 
pseudo-minimal is somewhat different. 

2.2 Pseudo-minimal DFA 

We now present an algorithm that directly (i.e. our algorithm never deletes a 
state) constructs pseudo- minimal DFA from D, without using a trie- like DFA 
as an intermediate step, or explicitly generating the set D' . Nevertheless, we 
first describe a particular (direct) way to build a trie-DFA, and then define a 
certain equivalence relation for the trie states, and show how we can during the 
construction avoid creating new states by identifying an equivalent state already 
present. 
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The algorithm can proceed recursively in either a depth-first or a breadth- 
first manner, with minor differences. We describe and give pseudo code for 
the breadth-first variant: the construction begins by inserting the starting state 
(root node) into queue of states; at each stage a state is dequeued and its 
children are computed and enqueued. The algorithm terminates when the queue 
becomes empty. As described above, each state u will have an associated list 
L{u), {L{u) = 0, if u ^ F). We will denote the partially computed list as L'(u) 
{L'iu) ^ 0). The following invariants are maintained: (a) when all the children 
(if any) of u are enqueued, the state u is fully computed and Eq. ([1]) is satisfied 
(post-condition); (b) when a state u is enqueued, then the list L'{u) satisfies 
Eq. ([2]) below (pre-condition): 

ji G L'{u) 4^ u matches dj. [0 . . . - 1] | dj^ G D. (2) 

I.e. ji G L'{u) iff u matches a prefix of dj. (note that — depth{u), if the paths 
are not compressed). Thus the algorithm initializes 

L'(g) = {0,...,n-1} (3) 

and enqueues q. At each iteration, one state u is dequeued, its "children" are 
initialized according to the pre-condition, and enqueued, and the post-condition 
for u is computed. Given the list L'{u) and Vc G S, we define 

L'{v) = {.it I j^ e L'{u) AND c G dj. (4) 

If \L'{v)\ > 0, then a transition S{u, c) = v is added, and v enqueued. Note that 
ji is put into hsts. The list L{u) is then computed as 

Hu) = {j, I j, G L'{u) AND ^ \dj^\}. (5) 

That is, we keep only the strings that end in u, and u becomes an accepting 
state iff |i('«)| > 0. All the a lists L'{v) and the list L{u) can be computed with 
a single pass over the the list L'{u). Alg. [T] gives the pseudo code. 

This is repeated until the queue becomes empty. Note that this computes 
exactly the same trie as one would get by first generating D' and then inserting 
the strings one at a time. However, our bulk-insertion method is more easily 
improved. 

We define the following relation between the states u and v. 

u=p V : L'{u) = L'{v) AND |u| = \v\, (6) 

which is clearly reflexive, symmetric, and transitive, i.e. an equivalence relation. 
The following is easy to notice: 

u=p V ^ C{u) — C{v), (7) 

where the language of u is 

C{u) = {u; G S* I S*{u, w) G F}. (8) 
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Alg. 1 PaTtition{D, L' , depth). 



1 L ^ 

2 for c ^ to (7 - 1 do P[c] 

3 for i ^ to \L'\ - 1 do 

4 L'\i] 

5 if |dfc| < depth then 

6 L ^ L U {/s} 

7 else 

8 for Vc G dk[depth] do P[c] ^ P[c] U {k} 

9 return (i, P) 



Hence wc will partition the states into equivalence classes, so that in the 
final DFA all states belong to a different class. Note that this does not result in 
a minimal DFA; i.e. we have that C{u) = C{v) ^ u =p v, while the implication 
would be required for a true minimal automaton. Note that by the definition 
we can still properly associate states with the lists L' and L. So we can call the 
result pseudo-minimal DFA as in [111 [S] , even when our definition should not 
be confused with the definition given in these papers 

We need to maintain sets of pairs (L',u), where L' is a key that is used 
to insert and search the state u, a representative of its equivalence class. The 
algorithm is now immediate: whenever we have computed a list L'{v), we search 
if it is present in a set S{depth(v)); if so, v can be replaced by the corresponding 
node u. In this case, v is not enqueued, as an equivalent state u is in the queue 
already. If L'{v) is not present, we insert {L'{v), v) to S{depth{v)), and enqueue 
V. Alg. [2] gives the complete pseudo code, keeping the automaton in its pseudo- 
minimal state throughout the construction. 

2.3 Using subsets for unary paths 

For a moment consider a plain trie with a path compression. In this case the trie 
has 0(|D'|) nodes (states), independent of the pattern lengths (without path 
compression, this is multiplied by 0{m)). While this may save space in many 
cases, this is not always so. Consider e.g. the unrealistically pathological case, 
where D contains only one string of length m, namely S™. This means that all 
tr™ possible strings are present in D' , and no path compression can take place, 
as there simply are no unary paths (the minimal and pseudo-minimal DFAs 
would both have only m + 1 states). We propose a slightly different, but much 
more effective, path compression. 

Consider now a string in D, and in particular that the string positions 
can be any subsets of E (not necessarily just single symbols). Assume that 
di[depth{u)\ = dj[depth{u)\, for some u and Vi,j S L'{u). This means that 
there is no need to branch, since all the subsets are the same, and no symbol 
in S can differentiate between any d,;, dj. Hence we could add a transition 
from u to (some) v using the subset di[depth(u)] as a label. This does not pose 
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Alg. 2 BuildDFA(D). 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 



q ^ NcwStatcO 

L'iq)^{0,...,\D\-l} 

Enqueue((3') 



while NOT QueueEmptyO do 
u -s— Dequeue 

{L{u),P) ^- PaTtition{D, L'{u),depth{u)) 
if \L{u)\ > then F ^ F U {u} 



for c to cr — 1 do 



if \P[c]\ = thencontinue 



V ^ Sca.i-ch{S[depth{u)], P[c\) 



if V = NULL then 
V -s— NewNodeO 
L'{v) <- P[c] 

Insert{S [depth{u)], {L'{v),v)) 
Enqueue (w) 



6{u, c) -i— w 



17 return q 



any problems, as (when used in recognition) we can still test in 0(1) time if 
p[depth{u)] € di[depth{u)]. (Note that our pseudo-minimization algorithm ef- 
fectively already handles this, i.e. under the above condition, (5(w, c) = w for 
Vc G di[depth{u)].) 

More generally, given a node u, and 



then di[depth{u) . . .h — I] can be used as a string label in a compressed unary 
path. 

The easiest way to utilize this is to use it only for unary paths to the leaves 
when |i'(u)| = 1. This is effectively achieved simply by replacing the line 15 
in Alg. [5] by "if |L'(w)| > 1 then Enqueue(i;)". It would be relatively easy to 
use the path compression in any unary path, but as show in Sec. [3] this simple 
method can give huge savings in both time and space. 

2.4 Analysis 

Let us now consider the running time of Alg. [21 with (our) path compression on 
leaves. We assume that the subsets di[j] have average size A, and that they are 
are randomly, uniformly and independently generated. At first we assume that 
there is a non-zero probability that two random subsets do not intersect (e.g. 



The partition of L'{u) can be implemented to take 0{\L' {u)\A) time. Each 
of the (T resulting new sets have average size 0(|L'(u)| A/cr), as for a random 
c £ E the probability that c G di[j] is A/ a. These sets are searched from 



Vi,j G L'{u) : d.i[k] = dj[k] \ depth{u) < k < h, 



(9) 



A < a/2). 
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S, and possibly inserted (if not found). The size of S is upper bounded by 
0(|<5|), the number of states in the resulting automaton. Hence insert /search 
can be implemented in 0(log(|Q|) + \L'{u)\A/a) worst case time with a number 
of radix-tree techniques, see e.g. [TU[T]. Therefore the total time per node is 
0{a(log{\Q\) + \L'iu)\A/a) + \L'{u)\A), i.e. 0(alog(|Q|) + \L'{u)\A), which is 
0(log(|Q|) + \L'(u)\), assuming a = 0(1). 

For a moment assume that we are building a plain trie, without path com- 
pression. Recall that by definition the length of the list L'{root) is exactly 
n. As described above, the lengtlQ of each of the a lists for the children 
of node u is 0(|L'(u)| A/cr), so the lengths of the lists L'{u) decrease expo- 
nentially when the depth of u (i.e. increase, as |i'(M)| = 0{{A/ a)^^^n). 
Hence \L'{u)\ = 0{1) when a = |m| > logg./^(n). The total number of 
states up to this depth is \Q\ = tr' = 0(a") = 0(ni°s-/^('")), that is, 
all the states have all the a possible branches up to depth a. As there are cr' 
nodes at depth i, the total length of all the lists at a depth i is on average 
0((A/tT)'7i tr') = O(A'n). Thus the total length of all the lists up to depth a 
is ^ "Ef A' = 0(nA") = 0(n'°s-/A('^)+i) = 0{v}°^'''^^'''^). 

Assume now (pessimistically) that path compression and pseudo- 
minimization take place only after depth a. After this depth, the lists have 
length k = 0(1), (and will continue to shrink until k = 1). There are only 
(fc) ~ 0{n^/k\) different lists of length k, but at the same time there are 
0(n'°^''/^^'^-') states (with associated lists), so by the pigeonhole principle many 
of the states must be equivalent, and arc combined into a single state. However, 
due to path compression, the process terminates for any state having k = 1. 
Hence the number of states per level starts to decrease exponentialljH after 
depth a. That is, the total number of states is bounded by two geometric se- 
ries, both having the largest term at depth a, where the automaton is in its 

"widest", i.e. the total number of states is asymptotically upper bounded by 
0(ni°s^/AM). 

Summing up, the total time is on average 

0{\Q\ log(IQI) +£) = 0(n'°s./A(-) iog(„)), (lo) 

again assuming a = 0{1). 

So far we have assumed that there is a non-zero probability that two random 
subsets do not intersect. Consider now the (rather uninteresting) case where the 
subset sizes are always A > a/2 (not just on average). At first, the process goes 
as before, the number of states increasing exponentially, and the list lengths 
|L'(m)| decreasing exponentially. However, assume now, for simplicity, that 
L'{u) = {i,j} for some state u. Due to A > a/2, the subsets di[h] and dj[h] 
must intersect (where h = |?l|). Thus the alphabet E is effectively partitioned 

'^In the "worst case" there is only one "new" set, being exactly the same as its parent, 
L'(u); but in this ease the corresponding node would not branch, so the complexity would 
only improve. 

^Note that without combining the equivalent states or the path compression, after depth 
a the number of states would continue to increase exponentially, resulting in a full trie. 
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into four disjoint sets: A = di[h] \ dj[h]; B = dj[h] \di[h]; C = di[h] D dj[h]; 
D = T,\ {dj[h] U di[h]). Group D does generate any branches for u. Symbols 
from A, B and C generate branches, but these are combined (group-wise) by 
the minimization, resulting in at most one new state per group, call it v. For 
A (similarly for B), \L'{v)\ = 1, and due to path compression, v will have 
no descendants. The interesting case is C. Note that C cannot be empty, so 
L'{v) = L'{u), and hence the process repeats for v. In other words, the process 
does not terminate until h = \di\. 

The situation is similar when |i'(u)| > 2. Note that after depth a we 
have |L'(m)| = 0(A) in any case, and because of the pseudo-minimization, 
there can be only (2) = 0(n^/A!) different states with lists of length 0(A). 
Thus in general the "breadth" of the automaton will stay approximately the 
same after depth a, and the total time is upper bounded by 0((n'°^''/'^^'^^ -|- 
min{n'°^"/^^'^-' , n^} m) log(n)), where m is the length of the strings in D. 

Finally, as the number of subsets of n items is at most 2", the trivial upper 
bound for the worst case size of our data structure is 0(m2"). This should be 
contrasted with the 0((T™) bound of [TU] . 

3 Experiments and final remarks 

We have implemented the algorithms in C, and ran the experiments on 3.0GHz 
Intel Corc2 with 2GB RAM, 4MB L2 cache, running GNU/Linux 2.6.23. 

The implemented algorithms are: Pscudo- minimal DFA (PM DFA), as in 
Alg. m minimal DFA (M DFA); PM DFA with path compression (PM DFA 
PC) on leaves, as detailed in Sec. 12.31 plain trie; and trie with path compression 
on leaves (Trie PC), as in PM DFA PC. Some results for the Tries are not 
included, as they could not fit into the available RAM. M DFA was computed 
from PM DFA, as computing it from D' or the corresponding trie would have 
been totally intractable in most cases. We implemented the set S in Alg.[3]with 
Patricia tries [12] . 

We have not implemented the methods in [TU], but we show that the lower 
bound (l-D'l) for the size of their data structure can be several orders of mag- 
nitude larger than our empirical sizes. In fact, we can build reasonably small 
data structures for problem instances that are completely intractable with their 
methods. 

Table |T] gives the results for some randomly generated instances. We used 
parameters {m,n,a, (A;, A/j),/), where m is the length of the strings (all n of 
equal length); (A^, A^), / denotes that in probability / any string location con- 
tains a randomly selected subset of E, where the size of the subset is randomly 
selected between A; . . . A^; otherwise (with probability 1 — /) the string location 
is a single random symbol from E. 

We report the number of states generated by the different methods, as well 
as the time in seconds, for some illustrative cases. As shown, the number of 
states generated is significantly smaller than \D'\ in all cases, sometimes the 
difference being many orders of magnitude. PM DFA is usually only slightly 
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Table 1: Experimental results for data generated using parameters 
(to, |-D|, cr, (A;, A/i), /). Times are given in seconds. \Q\ is the number of gen- 
erated states, and \D'\ is the number of different strings matching a string in 
D. 



(32, 10000, 2, (2, 2), 0.2), \D'\ = 3, 418, 449 


Method 


time (s) 


IQI 


PM DFA 


0.828 


476,365 


M DFA 


1.20 


385,255 


PM DFA PC 


0.820 


379,948 


Trie 


6.71 


18,767,894 


Trie PC 


0.326 


948,493 



(32, 10000, 4, (2, 4), 0.3), \D'\ = 40, 755, 624, 312 


PM DFA 


1.42 


680,906 


M DFA 


2.39 


635,795 


PM DFA PC 


1.40 


499,212 


Trie PC 


1.19 


4,203,673 



(16, 10000, 20, (2, 6), 0.75), \D'\ = 1, 830, 872, 526, 457 


PM DFA 


6.50 


1,335,251 


M DFA 


18.2 


1,320,126 


PM DFA PC 


6.21 


1,276,985 


Trie PC 


6.29 


22,431,630 




(16, 100000, 32, (2, 32), 0.01), \D'\ = 1, 033, 039 


PM DFA 


6.28 


1,331,241 


M DFA 


12.0 


964,847 


PM DFA PC 


0.486 


149,998 


Trie 


15.5 


6,981,214 


Trie PC 


0.236 


235,565 




(16, 1000, 32, (32, 32), 0.25), \D'\ w 1, 190 x lO^'' 


PM DFA 


0.954 


118,474 


M DFA 


7.29 


115,797 


PM DFA PC 


0.907 


110,340 



larger than the true minimal DFA, while using path compression with PM DFA 
is usually smaller than M DFA. In some rare cases using path compression with 
a plain Trie is very competitive. Fig. [T] shows the exponential increase (depth 
^ a) and decrease (depth ^ a) of the number of states as a function of the 
depth in the automaton / trie, and illustrates the behaviour when all subset 
sizes are > a/2. 
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Figure 1: The total number of states generated for each depth of the automaton 
/ trie during the construction. Above: a = log^/^(n) « 10.04; middle: a « 
5.07. Below: subset sizes always > a/2. 



Finally, we note that our methods have applications in on-line dictionary 
string matching, e.g. in J-matching and {S, 7)-matching. It turns out that we 
can solve both problems in 0{\T\ \og„/g{nin)/m) average time, which is optimal 
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for i5-matching [S], for a dictfonary of n patterns of length m, and a text of length 
|T|. We leave the details for future work. 
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